HW 01

Author

Nathan Herling

0 - Setup

Loading required package: pacman
Using GitHub PAT from the git credential store.
Skipping install of 'dsbox' from a github remote, the SHA1 (244ecdfe) has not changed since last install.
  Use `force = TRUE` to force installation
The packages loaded:
- tidyverse
- glue
- scales
- lubridate
- patchwork
- ggh4x
- ggrepel
- openintro

1 - Road traffic accidents in Edinburgh

This plot visualizes the distribution of road accidents at different times of day, separated into weekdays and weekends. The data is faceted by whether the day is a weekend, making it easy to compare patterns across these two categories. Accident severity is color-coded to distinguish between fatal, serious, and slight incidents. The plot highlights time-based trends that may help identify peak periods of high-risk activity. This information could support efforts to improve traffic safety or allocate emergency response resources more effectively.

One striking trend is the high number of fatalities on weekdays, in contrast to the weekend plot, which shows no visible fatalities.

2 - NYC marathon winners

Question 2a
The histogram highlights a bimodal distribution, revealing differences in marathon times between men and women. It effectively shows the shape of the data and how values cluster across different ranges. In contrast, the box plot does not capture modality but provides a summary of central tendency, variability, and outliers. While histograms emphasize distribution shape, box plots offer a concise overview of data spread.

Question 2b
Box plots are a great way to visualize distributions. Here, we see that the men’s and women’s time distributions slightly overlap, but only when considering some of the men’s outliers. Plotting either alone would not allow us to see the bimodal distribution, plotting together however, allows us to see a distinctly bimodal distribution, where the two groups are largely separated.

Question 2c
In a large number of examples, redundancy is subjective—what one person considers ‘redundancy’ might be seen as a ‘feature’ by another. Here, we can consolidate the data distributions onto a single plot, as they largely do not overlap. This approach eliminates the need to label both graphs separately. Furthermore, to leverage the visual tendency to perceive relative color differences, one of the groups can be changed to gray [cornsilk4]. By keeping our data the same and reducing the amount of ink used, we’ve effectively increased our data/ink ratio, which is the desired effect.

Question 2d

This plot allows us to clearly see how tightly the data is grouped around the mean for both categories. While the histogram also conveys this information, this visual representation presents it in a different modality. The bimodal distribution of the two data categories is clearly visible, along with the minimal overlap between the datasets. Additionally, both datasets exhibit a similar decaying exponential shape from left to right, which was not apparent in the other data plots.

3 - US counties

Question 3a
This code attempts to make two differenlty dimensioned plots overlay one another.

ggplot(county) +
  geom_point(aes(x = median_edu, y = median_hh_income)) +
  geom_boxplot(aes(x = smoking_ban, y = pop2017)
  
The first line - Creates a geom_point or scatter plot with median_edv vs. median_hh_income.

ggplot(county) +
  geom_point(aes(x = median_edu, y = median_hh_income)) +
  
The second line - Creates a boxplot with smoking_ban and pop2017 as its parameters.

ggplot(county) +
  geom_boxplot(aes(x = smoking_ban, y = pop2017))
  

Both geom_point and geom_boxplot are layered in the same ggplot, but they rely on different x and y variables. On their own, each layer would produce a meaningful plot, but combined, they result in a confusing and misleading visualization—a kind of visual cacophony.

Technically, the code may run without error, but it doesn’t “work” from a data visualization standpoint. Mixing different aesthetics (continuous vs. categorical x-axes) in one plot without coordinating scales or structure leads to a plot that is hard to interpret and potentially misleading.

Question 3b
We are to compare two graphs, each showing the same data but presented differently. County facets plot County facets plot

An obvious answer is the left graph, where the data is plotted horizontally has more visual striking power. Yet, if we look at how the data is grouped on the vertical (right) graph we see some skewing (topographical compression) of the data geometry represented in the left graph.

Therefore, from these two graphs we cannot conclude that ‘in general’ graphing data such as this horizontally will yield better visual results. To truly make a comparison the vertical graph would need to be stretched out to the same dimension/scale as the horizontal graph.

Question 3c

Note(s):

  • I left these warnings in - as they seem relevant.
  • If I had more time, I’d work on cleaning the data.
  • I did not have enough time to adequately solve the size rendering problem here.
  • Use ‘zoom’ when needed (ha).
  • I kept the ‘minimal’ theme on - so the background grey shading isn’t present on my graphs.
Warning: Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).
Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).
Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).
Removed 3 rows containing missing values or values outside the
scale range (`geom_point()`).

4 - Rental apartments in SF

Question 4a
Describe the relationship between income and credit card balance. Touch on how/if the relationship varies based on the four (4) category combinations.
The relationship between income and credit card balance is clearly increasing—higher income tends to correspond with higher balances. This positive trend holds across all categories examined. Notably, the slope of the trend lines is nearly identical when comparing non-married students to non-married non-students, and a similar pattern is seen when comparing married students to married non-students.

However it should be noted that all distributions examined are skewed towards the left end of the wage scale.

This suggests that while income strongly influences credit card balance, the effect of student or marital status on that relationship is minimal. The key takeaway is that income is the dominant factor, with student and marital status having little impact on the strength of that relationship.

Question 4b
Based on your answer to part (a), do you think married and student might be useful predictors, in addition to income for predicting credit card balance? Explain your reasoning.
Yes, with some caveats. There are a few outliers with zero balances that should be taken into account. Additionally, the distribution of the data is skewed—most observations fall below an income of $75K, and within this range, there is a higher variance in credit card balances. Overall, the relationship between income and balance appears to be roughly linear, suggesting that income is a useful predictor. Finally, it should be noted that this synopsis could apply to all four categories examined, not just: married students.

Question 4c

Question 4d
Based on the plot from part (c), how, if at all, are the relationships between income and credit utilization different than the relationships between income and credit balance for individuals with various student and marriage status.
The relationships between income and credit utilization now differ more distinctly across the four categories compared to the overall positive trends observed in part (c) with credit balance. Specifically, there is a positive relationship between income and credit utilization for both married and non-married non-students. In contrast, non-married students show a strong negative relationship, which may partly be influenced by the geometry or distribution of the dataset. For married students, the relationship is only slightly negative. Overall, these differences highlight that the patterns seen in credit utilization are not as uniformly positive as those in credit balance, and they vary more noticeably by student and marital status.

5 - Napoleon’s march.

Question 5
Part a
Here are a couple of sites that helped me with the code and understanding of the figure.

Part b
I added extra code comments with ‘#<—–’ in the code for this problem, to add extra description(s)/proof of knowledge of code functionality.
Part c
For my individualization, I forced the city names to ‘non overlap’ and changed their color to a hue that was legible on both white and brown backgrounds. [#00CED1]
The non-overlapping caused a problem of the names then not all fitting in the graph window. This was solved with a call to: ggrepel::geom_text_repel(..)